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Abstract 

The article presents an application of Hidden Markov Models (HMMs) for pattern recog¬ 
nition on genome sequences. We apply HMM for identifying genes encoding the Variant 
Surface Clycoprotein (VSC) in the genomes of Trypanosoma brucei (T. brucei) and other 
African trypanosomes. These are parasitic protozoa causative agents of sleeping sickness 
and several diseases in domestic and wild animals. These parasites have a peculiar strategy 
to evade the host’s immune system that consists in periodically changing their predominant 
cellular surface protein (VSG). The motivation for using patterns recognition methods to 
identify these genes, instead of traditional homology based ones, is that the levels of se¬ 
quence identity (amino acid and DNA sequence) amongst these genes is often below of what 
is considered reliable in these methods. Among pattern recognition approaches, HMM are 
particularly suitable to tackle this problem because they can handle more naturally the 
determination of gene edges. We evaluate the performance of the model using different 
number of states in the Markov model, as well as several performance metrics. The model 
is applied using public genomic data. Our empirical results show that the VSG genes on T. 
brucei can be safely identified (high sensitivity and low rate of false positives) using HMM. 

Keywords: Hidden Markov Model, Glassification, Gene Sequence Glassification, Try¬ 

panosoma brucei, Variant Surface Glycoprotein 
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1. Introduction 


Nowadays, the community disposes of a huge amount of genomic sequential data. Machine 
learning techniques are indispensable in the analysis of sequences due to the enormous size 
of the available genome databases. In this article we apply Hidden Markov Models (HMMs) 
for identifying a specific pattern in a hyper-variable gene sequences. Markov models have 
been widely applied for modelling several complex systems that evolve in time. In particular 
the HMM have proven to be a powerful tool for classification tasks on time series data and 
spatial sequential data. During the last 20 years, the model has been used for analysing 
biological sequences, and for solving problems like gene finding and protein analysis. We 
apply a HMM for identifying genes encoding the Variant Surface Glycoprotein (VSG) in the 
genome of Trypanosoma brucei (T. brucei). This approach can be naturally be extended 
to other African trypanosomes that also have this type of of hyper-variable genes. These 
are parasitic protozoa causative agent of sleeping sickness in humans and several diseases 
in domestic and wild animals. These parasites have a peculiar strategy to evade the host’s 
immune system that consists in periodically changing their predominant cellular surface 
protein (VSG). 

There are several methods that have been applied for detecting coding regions in a 
genome, among which the m ost used are those based on sequence comparison and ab initio 


methods ( Wang et ah . 20041 ) . Homology gene finding methods are based in given a DNA 
sequence identifying its homologous sequences in other genomes stored in databases. In ab 
initio gene prediction the goal is finding either the nearby presence or statistical properties 
of the coding region i.e. use gene structure as a template to detect genes. Although, 
for the particular case of the VSG genes these techniques sometimes do not yield reliable 
results, due to the high variability of the VSG genes. It means that, the levels of sequence 
identity (amino acid and DNA sequence) amongst the VSG genes is often below of what is 
considered reliable in the traditional homology based methods. Therefore in this article, we 
study the possibility of locating these genes looking for a statistical signature using HMM. 
In the following, we begin by specifying the motivations for studying this particular genoma. 
Next, we clarify the contributions and goals of this article. 


1.1 Motivation 

The African Trypanosomiasis also called sleeping sickness is caused by T. brucei para¬ 
site. There are two subspecies that causes disease in humans: T. brucei gambiense and T. 
brucei rhodesiense. A third sub-species named T. brucei brucei exists although it rarely 
infects humans. Around t he 90% of the sleeping sickness is caused by T. brucei gam¬ 
biense ( Organization . 200(tI ). The disease infections occur in regions of the sub-Saharan 
Africa. The disease is propagated by the bite of an infected tsetse fly that acts as disease 
vector. Even though over the last years t he number of cases have decreased, in 2009 were 
reported 9878 cases ( Organization . 2006h . This decreasing tendency has continued, and it 
has been reported 7216 cases in 2012. 

Gell surface of the trypanosoma consists of a uniform layer of Variant Surface Glycopro¬ 
tein (VSG) that blocks the trypanosomes recognition by the immune system of the mammal 
host. Once the individual is infected, the trypanosoma expresses a particular VSG and the 
host immune system reacts by generating a response to this protein layer, which causes a 
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decrease in the number of trypanosomes. Some trypanosomes can evade the recognition by 
the immune system. By expressing an alternative VSG for which have not yet developed 
antibodies. This allows increasing the population of infectious agents sustaining a long term 
infection, thus increasing the chances of transmission. Therefore, various infections occur 
each of which is caused by cells expressing a different VSG variant. 


1.2 Objectives and Contributions 


HMM is a probabilistic model used for the analysis of sequential data. There is a general 
consensus about the power of HMM for classifying time-series data and sequential data. The 
goal is developing a machine learning tool based on HMM that identifies homogeneous zones 
with the VSG genes in the genome of T. brucei. Among pattern recognition approaches, 
HMM are particularly suitable to tackle this problem because they can handle more naturally 
the determination of gene edges. As far as our knowledge is concerned, this article presents 
the first mathematical approach based on HMM for modelling such sequence. 

The main contribution is proposing an algorithm for classifying zones of a genome on two 
classes, which is based on the following three features: gene location, correlation between 
symbols in nucleotide sequence and its hidden information included in that sequence. The 
proposed approach is an hybrid procedure that combines the gene location and the power 
of the HMM technique for modelling the underlying structure of the sequenc e. We analyse 


the p erformance of our procedure on a public data from GenBank database (jBenson et al. 
2 OO 9 I ) . The performance of the algorithm is evaluated using different measures from the 


confusion matrix. Despite its simplicity, according to our experimental results the algorithm 
is robust and performant in accuracy (high sensitivity and low rate of false positives) and 
time for detecting VSG genes on T. brucei. 


1.3 Organization 

This article is organized as follows. Next section is a general introduction to HMMs and its 
applications. Section [3] introduces the algorithms used for estimating the parameters of the 
model. In 13.11 we present the Baum-Welch Algorithm, and in 13.21 we introduce the Viterbi 
algorithm. We present the methodology used for solving our problem in|Tl The experimental 
results are presented inO Finally, we conclude the article and discuss future research lines. 


2. Hidden Markov Model Description 

In this Section, we describe in brief the Hidden Markov Model (HMM). We discuss the 
general assumptions and properties of a HMM, and we present the approach to use it on the 
gene identification problem. The end of this section presents some related works that have 
used HMM for classification on sequential data. 

2.1 General Introduction 

Markov models are one of the most powerful tools for stochastic modelling of dynamic sys¬ 
tems. The theory of Markov is rich and quite complete, and the associated algorithmic tools 
are very efficient. The Markov models are characterised for their high ability to capture all 
kinds of behaviors. The model provides time-scale invariability in recognition and learning 
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capabilities (|Yamato et al.l . Il992l i . A HMM is a particular time-discrete Markov model where 
the states are not dire c tly visible. The theory o f HMM was developed by Baum in th e late 
60s ( Baum and Eaeon . 1967 : Baum et ah . 197[ll l. and in 1989, Rabiner ( Rabinei . 1989l i pub¬ 
lishes a tutorial describing Baum theory applied to speech recognition that is a reference in 
the study of such model s. The fami l y of HMMs has also shown to be effective for analysing bi¬ 
ological se quence data (lYoorj. 120091 : iDurbin et al.l . 119981 1 . The applicatio ns include: sequence 
alignment (Pachter et al.l . 12002 1. gene and protein structnre pr edictions (Munch and Krogh . 


20061 : I Won et al.l . 120071 1. modelling DNA sequencing errors (iLottaz et al.l . 120031) . and for 


analysing RNA structure (jYoon and Vaidvanathanl. I200N: iHymanci et al.l . 120071 1. A HMM 


is composed by the following elements (jRabiner and H.l . 119861 1: a finite set of states, a finite 


alphabet, and probabilities of state transition and symbol emission. At each time t a new 
state is entered in the system following a transition probability distribution that verifies the 
Markovian property, after each transition an output symbol is produced according to some 
probability distribution that depends on the current state. 

Let A be a finite alphabet and £ be a finite set of states or labels. Given a sequence 
of observations x = (xi, X 2 , ■ ■ ■, x^), Xj G A and a collection of states y = {yi,y 2 , ■ ■ ■, yn), 
yj G £ for all j, where each Xj has been emitted by its state yj. A HMM describes how the 
sequence x and its associated state y progresses in time. Note that, the index j (j = 1,..., n) 
can be thought of as a time index. We say that the states are hidden due to the sequence 
of observations is known but its related states are unknown. The model works as follows: 
it starts in an initial state chosen according to some probability, it emits an observation, 
moves to a new state emits a new observation and so on until to reach some final conditions. 

The model parameters are the probability of emitting symbols and the probabilities of 
the state transitions. Let P(xi|yi = k) be the emission probability that the symbol Xi is 
generated when the sequence is at the state k at time i. The probability of transition from 
the state k to state I is given by P(7/j+i = l\yi = A:), i.e. the conditional probability that the 
sequence is in the state I at time z -|- 1 when at time i was in the state k. For simplicity 
the above probabilities are respectively written by P (yj_|_i|yj) and P (xj|z/j). At n steps the 
model generates a sequence x = (xi, X 2 , ■ ■ ■, x„) for which passes through a sequence of 
states y = {yi,y 2 , ■ ■ ■ ,yn)- For a fixed length n the model defines a probability distribution 
over all sequences of observations x and all possible states y. Then, the probability that the 
model transits the path y and generates the sequence x is given by 


n 

i=l 

where the first transition P(yi|z/o) is given by P(z/i) = P(yi|z/o)- Figure [T] illustrates the 
relationship among states and observations in a HMM. We can see the conditional indepen¬ 
dence among the events in the system. The probability of the state z/j only depends on the 
previous state yi-i, and the observation generated at time i only depends on the state at 
time yi- 

In this article, we denote the unknown model parameters using greek letters, the matrix 
of transition probabilities by g, the emission matrix by s, and the collection of the whole 
parameter of the model by 6. We use bold fonts for matrices and vectors and normal 
fonts for scalars. Those parameters are estimated in a training step by a particular case of 
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Figure 1: Usual representation of a HMM. Figure illustrates the conditional independence 
relationship among the states and the observations in the system. 


the Expectation M aximization algorithm named the Baum-Welch algorithm (jWelchl . [2003|; 


Hastie et al.l . l200ll ). The model is based in the following assumptions: 


• Markov assumption: the transition probability among the states is dehned by P(?/j+i |?/j). 
This means that the next state depends only on the current state that is called Markov 
assumption of first order. 


• Homogeneity: it is assumed that the state transition probabilities are homogeneous, 
i.e. they are independent of its location in the sequence. As a consequence, the 
probability of moving from one state to the next state is independent on the position 
of the sequence in which this transition occurs (it is independent on time). Besides, it 
is assumed that the emission probabilities are also homogeneous. 


• Conditional independence among the observations: each observation Xi only depends 
on the state yi, that is Xj is generated by the state 7/j with P (xi|yj). 


2.2 Applying HMM in Biological Contexts 

A DNA is a macromolecule composed of simple units called nucleotides. Each nucleotide 
is composed of deoxyribose, phosphate and one of the four nitrogenous bases: adenine 
(A), cytosine (C), guanine (G), thymine (T). The DNA is represented by a double helical 
structure wherein each complementary base pair [A-T, C-G) are connected by hydrogen 
bonds. The two DNA strands are in opposite directions to each other, positive or 5' — 3' 
(left to right) is complemented by the reverse or complementary sequence. The selection of 
a HMM is given by the alphabet, the number of states and the pattern of transitions of the 
Markov chain. For this problem, the alphabet is defined as A = {A, C, G,T}. The number of 
states depends on the problem, for example can be £ = {gene, intergenic region} if the goal is 
determining genes and intergenic regiones, and can be £ = {exon, intron, intergenic region} 
if also there is interest in identifying non-coding (introns) regions. Although, the hidden 
space can be easily generalised for more states in the case that we are interested in identifying 
a higher number of regions in the sequence. The emission and transitions probabilities are 
numerically computed. 

More formal, given an alphabet A, a set of states £ and a DNA sequence x composed 
by xi, X 2 ,..., x„ where Xj € A, the task is to assign to each symbol Xj for j = 1,..., n one 
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label Hi in £. In the specific problem of VSG-gene identification, we define the alphabet 
as = {A,C,G,T}, and we analyse the HMM model for two situations: considering two 
states (£ = {VSG-gene, no VSG-gene}) and considering three states 

(£ = {VSG-gene, no VSG -gene,other st r uctur es}). The sequence x was taken from the 


public GenBank database ( Benson et al.l . 2009f l. 


2.3 Related Works 

HMMs have been applied in several areas of pattern recognition due to their powerful for 
modelling sequential data. In part i cular , they have been especially used for developing 
speech recognition devices (Rabiner, 1989l i. Basically, this problem consists in predicting a 
spoken word from a speech signal and translate into text. For example, hybrid techniques 
that use ideas from the Neural Netwo r k dom ain and HMMs has been applied for speech 
recognition systems ( Trentin and G(M . 20011 ) . Ano ther tool used for solving this problem 
has been Gaussian Mixture HMM ( Rabiner . 1989l i. In the case of phone spe ech recogni' 


tion a hybrid model that employs deep learning with HMMs was studied in (iDahl et al 
20 111 ) . H MM has been also applied for acoustic nrqdelin g using Mixture of Gaussian Dis¬ 


tributions (jjuang et al.l . Il986l : iRabiner and Schafeil . |2007P . In a ddition, the model wa s suc¬ 


cessfully applied in the Human-Gomputer Interface area. In (lYamato et al.l . Il992l ) time 


sequential images expressing h uman actions were en coded in a sequence of symbols, which 
were modeled by a HMM. In (iLiu and Wand . l201ll ) the authors use the sequence of facial 
temperature for emotion recognition, the transitions between emotional states are modelled 
by HMM. 

S ince the late 80s, HM M has been successf ully applied in the area of computational biol¬ 
ogy ( Durbin et ah . 19981 ). In ( Ghurchill . 19891 ) the author evaluated HMM for modeling five 
DNA sequences: the yeast mitochondrial, human and mouse mitochondrial, a human X chro¬ 
mosomal fragment, and the b acteriophage lambd a gen ome. Other applic ation s in this first 
stage have been presented in (iKrogh et al.l . Il994l i and (jBaldi et al.l . Il994l l . In (jKrogh et al 


I 994 I ) the authors applied HMM for searching common patterns between a protein sequence 
and multiply aligned sequences. They s hown the effectiven ess of the model for searching a 


sequence in a given protein family. In (jBaldi et al.l . [199^ the HMM was used to capture 


statistical characteristics in three protein families: globins, immunoglobulins, and kinases, 
wherein the model was applied for classification, multiple alignment s, and motif detec tion. 
Another protein structure modeling us ing HMM was presente d in ( Stultz et ah . 19931 ). A 
classification problem was studied in ( Henderson et al. . 19961 ) where the authors applied 
HMM for segmenting uncharacterised genomic DNA sequences i nto exons, in t rons and in- 
tergenic regions. The DNA sequencing errors were modeling in ( Lottaz et al. . 2003h where 
the authors improved the detection of translation start and stop sites, and thus the location 
of the coding regions. 

Other applications correspond to the task of search ing structures on genomes. Th e 
model was used for gene se arching in bacterial genomes (jPukashin and Borodovs^ . 19981 ). 


and in (Delcher et al. 


1999h was shown the effectiveness of HMM for finding gene on eukary¬ 


otic genom es. The model w as also used for identifying homologous protein and nucleotide 
sequences (jFinn et al.l . 120111 ). A variation of HMM named Profi l e-HMM have been applie d 


for analysing genomic sequences ( Eddv et ah . 1995 : Eddv . 19961 . 19981 : Gough et ah . 2001 1. 
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The huge interest in the community concerning HMMs for solving biological computational 
problems has caused that several re searchers develop specific s oftware for gene sequence 
and prediction s^nals, these in clude ( Lukashin and Borodovskv . 19981 : Delcher et ah . 1999 : 


Finn et ah. 2011 


Flickekl . 120071 ). 


Other learning techniques have also been used for so lving classification tas ks and clus 


tering problems in the field of computational biology. In (jPecaprio et al.l . 120071 1 the authors 


described a gene predictor based on Conditional Random Fields (CRF) that incorporates 
inf ormation of Exyres sed Sequence Tags (ESTs) for predicting genes on fungal genome. 


In (jGross et al.l . 120071 ) the authors combined CRF with Support Vector Machines (SVM) 


and tested for the human genome, the model incorporates data from other genomes called 
informants, making a multiple alignment based on some assumptions about the respective 
evolutionary processes. Another applicat i on of the HMM combined with SVM analyses the 


genome of a nematode ( Schweikert et ah . 20091 ) 


In biological sequen ce analysis Artificial Neural Network (ANN) have been also used. 
In ( Rebello et al.l . 2011 ) ANN was applied for predicting genes on a Streptococcus genome , 


and they were also used on the prediction of splice sites of a gene in ( Johansen et ah . 20091) . 


Decis ion trees were performed for finding genes in vertebrate DNA sequences (|Salzberg et al 
199fil ). and a variation (called Randomized oblique decision trees) com bined with other learn¬ 


ing techniques were applied for gene modeling in (I Allen et al.l . l2004l ) . 


In the particular case of T. brucei, an evolutionar y anal ysis considering the family mem¬ 


bers of this gene was introduced in ([Alvarez et al.l . 119961 ). Although, in this article the 


authors analyse the sequence doing pairwise comparisons between T. brucei and other pro¬ 
tein coding genes. An example of using a syste m based on HMM to search genes on a 


200 , 11 ) . although in this 


particular T. brucei chromosoma is presented in (|El-saved et al 
article the sequence has only one VSG cluster. 

During the last decades, HMM have been used in a large number of applications for 
analysing sequential data. Here, we presented a summary that is not exhaustiv e, a more com¬ 


plete overview about HMM applicati ons in biological sequences can be seen in ([Durbin et al 


1998: 

Ghoo et ah. 

2004; 

Yoon. 


3. Training and Inference 

Let 6 be the vector with the parameters of the model (the probabilities of transitions and 
the probabilities of emissions). The training process is based in the Maximum-likelihood 
Estimation (MLE) that consists in adjusting parameters to maximize the likelihood for a 
given sequence x generated by the model. That is finding 6 such that P 0 (x) is maximised. 
For any two states k and I, and a symbol b we find the parameters pki = P(yi+i = l\yi = A;) 
and ek{b) = P(xi = b\yi = k) as follows: 

. _ Pkl . ,, X _ Ek{h) 

Pkl 1 ^ k \ P ) V J-, /7/\ 7 

Hl'PkV Hb'PkW) 

where Pki is the number of transitions from k to I, and Efc( 6 ) is the number of times that 
the state k emits the symbol b in the current training sequence. The adjustable vector 
parameter 0 is a collection composed by p^i and for all k and 1. 
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A co mputat i onal e fficient algorithm for setting the parameter is Baum-Welch (BM) Al- 
gorith m (IWelchl. l2003ll . that is a particular case of the well-known Ex pectation-Maximization 


(EM) (jBaum et al.l . Il970l : I Dempster et al.l . 119771 : iHastie et al.l . l200lh . The BW algorithm is 
based in two auxiliary recursive procedures called forward and backward algorithms. The 
forward algorithm is a dynamic programming algorithm that com putes i n an efficient way 
the forward probability, which is given by the following expression ( Yoon . 2009t l 


ak{n) = F{xi,...,Xn,yn = k) 

= P{xn\yn = k) ^aj(n - l)P(yn = k\yn-i = j). 

3 

The time-complexity of the forward algorithm is nAT^ where K is the number of states and 
n is the size of the sequence x. Algorithm [1] presents the procedure used for compute the 
forward probabilities ak{n) for all states k. 


Algorithm 1: Forward Algorithm. 


// Initialization 


1 afc(l) = ^{xi,yi =k) = P(yi = k)F{xi\yi = k), Vk = 1,... K] 

2 for (7 = 1 to n) do 


3 


afc(i) = F[xi\yi = k) '^ajii - l)F[yi = k\yi_i = j)] 
3 


4 P(x) = ^afc(n); 

k 

5 Return P(x) and ak{n) for all fc; 


The backward probability is also recursively defined 
/Ifc(n) = F{Xn+l,. . . ,Xn\yn = k) 


^p(xn-ri|yn-ri = l)/3i{n + l)P(?/„+i = l\yn = k). 


Algorithm [2] illustrates the procedure used for compute the backward probabilities ftk (n) 
for all states k. In a similar way that the forwa rd case, this backward probability can be 
recursively computed in a nK^ time-complexity ( Yoon . I2nn9li . 


3.1 Description of the Baum-Welch Algorithm 

The BW algorithm is iterative, at the first iteration is initialised the transition probability 
matrix g and the emission matrix s. The initialisation can be done using a uniform dis¬ 
tribution or in an arbitrary way based on a priori known information. In the following we 
present in brief the main mathematical expressions of the Baum-Welch method. Let x be 
the current training sequence, the transi t ion p robability from the state k to the state I at 
the position z of x is computed as dYoonl. I20n9h 


= k,yi+i =l\x) = 


afc(z)P(l/i+i = l\yi = k)F{xi+i\yi+i = l) I3i{i -\- 1 ) 

P (x) 
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Algorithm 2: Backward Algorithm. 


// Initialization 

1 /3fc(n) = lVfc = l,...A; 

2 for (i = n — 1 to 1 j do 

3 


Pkii) = ^P(xi+i|yi+i = + l)P(?/i+i = l\yi = /c); 

i 

4 P(x) = ^P(yi|?/o = k)F{xi\yi = k) f5k{l); 

k 

5 Return P(x) and /3fc(n) for all k] 


where P (x) = ''^^ak{n) is computed using the forward algorithm ( Rabiner . 1989). The 

k 

probability of generating the symbol b when the state is k is computed as 

Y.{r.Xi=b} 


P(x) 


( 2 ) 


Let T be the number of sequences in the training set, and we denote the training sequences 
by for t = 1,... ,T. The probabilities of transition and emission are computed as 


Pkl = 


Pkl 


Ylv Pkl' 


and ek(b) = 


Ek(b) 


where 


Eb'Ek(bO^ 


P(fw) Y = Ayi = k)F{xfl^\yi+i = 1) ^f\i + 1 ) 




( 3 ) 


Ek{h) = Y. 




P(xW) 


( 4 ) 




It has been proven in (jHastie et al.l . l200lh that the likelihood function increases at each 
step of the BW algorithm. That is Pq^h+i) {x) > Pgih) {x) for all step h. The algorithm ends 
when the difference between the likelihood function at the step h + 1 and the value at the 
step h is lower than some arbitrary threshold, that is dist{P^(h+x){x)^Pg(h){x)) < e where 
dist{-) is an arbitrary distance. In our numerical experiences, we used Euclidean distance 
and e = 0.00001, in addition we control the number of iterations (the maximum number of 
iterations was 500). 


3.2 Description of the Viterbi Algorithm 

Once the model parameters are estimated, the Viterbi algorithm is useful for finding the 
best sequence of states y for generating a specific sequence x. The method assumes that the 
best sequence is that that generates the sequence x with higher probability, that is 


y 


* 


= argmax 
y 


P{x,y) 

P(x) 


( 5 ) 
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Note that the denominator in ([5]) does not depend of y. So, the goal is to hnd y* such that 
P(x, y) is maximised. Viterbi algorithm is iterative. Given the samples until the position 
i, let Vi{k) be the probability of the best sequence of states (path) that generates the state 
k. At the initial iteration, the method sets Vi{k) with the probability that observation 
Xi is generated. Next, Vi(l) is computed for each position i = 2, ...,n and each state 
k = 1,... ,K. At the step i, the algorithm finds the sequence of states yi,y 2 , ■ ■ ■ ,yi that 
generates the sequence Xi,X 2 , ■ ■ ■ ,Xi with highest probability, for each location i and state 
k {yi = k). The algorithm computes the for the location i + 1 and all possible states 

I following 

Vi+i{l) = max ^Vi(/c)P(yj+i = l\yi = k)^¥{xi+i\yi+i = l). (6) 

We define an auxiliary variable Ij(/) that contains the state that maximises the expression ([6]) 
at each iteration. The probability ¥(x,y*) is given by the maximum value of Vn{k) for all 
the states k. The most probable path y* is computed using a backtracking procedure from 
(n till 1) and y*_^ = Ii{y*). 


3.3 Technical Issues 

We present here some technical issues related to the performance and numeric stability of 
the algorithms during the training process 


Algorithmic stability: The product among probabilities should be controlled, because 
it can occur underflow. This is a common problem when we are working with power 
of Markov matrices and products of probabilities. The stability can be improved 
using scaling coefficients in the forward and backward probabilities and log arithmic 


transformations in the Viterbi algorithm ( Durbin et ah . 1998 : Rabiner . 1989l l 


Performance of learning on imbalanced dataset: we say t hat a dataset i s imb alanced 
if the classes are not approximately equally represented ( Chawla et ah . 2002l l. This 
imbalance can provoke suboptimal classification perf ormance, due t o the learning tool 


overweight the large classes ignoring the small ones (IChawla et al.l. I20n2h . Some tech 


niques have been introduced for improving the predictors in the imbal anced context 


which can be performed for solving the gene classification problem (jChawla et al. 
2nn4h . 


4. Methodology 

In this section we describe the methodology used for applying HMM for solving the problem 
of the labelling areas of T. hrucei genome as VSG gene or non-VSG gene. The alphabet is 
composed by the four nucleotide types A = {A, C, G, T}. Concerning the number of states 
in the problem, we begin by adjusting the model with two states, which represents a binary 
situation: yes or not VSG region. Let S be the non-redundant collection of sequences of 
T. brucei genome. The set S has 9 sequences of different lengths and different numbers of 
clusters of VSG genes. As usual, we generate a random partition of the set S in two subsets: 
a training set and a validation set . The sequences in are used for estimating the 
transition probability matrices and the parameters of the emission distribution. We apply 


10 

































BW Algorithm for performing that estimation. Note that, only the nucleotide sequence is 
taken account in the BW Algorithm, the information about the VSG genes is not used. 
Next, we apply the Viterbi Algorithm on the sequences in in order to estimate the most 
likely sequence of hidden states. We determine those regions in the testing sequences that 
are predicted as VSG genes. Finally we evaluate the model using the predicted values with 
the VSG annotated genes of the sequences in . 

In our experiments we find that the predictor can be improved defining an interval of 
control, which we call location control filter. We can define a range where the VSG are 
presented, which is based on the minimum and maximum lengths of the labeled VSG. Once 
this range is defined, we can use it as filter for the predictions given by the BW and Viterbi 
algorithms. The model assigns a label VSG to a sub-sequence of nucleotide if verifies the 
following two conditions: it is predicted by the HMM model such as VSG and it passes the 
filtering operation. We call filter operation if the subsequence has location inside the range 
defined by the training set. As a consequence, it is possible to adjust the predictions given 
by the model discarding those that do not pass the hlter. Obviously, the range depends 
of the training set, then it can change according different training data. In spite of that, 
this is a common practice in machine learning. For instance, when the input patterns are 
normalised. This normalisation more often takes as reference the maximum and minimum 
value of the input variables. Here, the approach is similar, we consider the minimum VSG 
location and the maximum VSG location using the training data, out of this range the genes 
are considered non-VSG. Figure [2] presents a hierarchical schema among the techniques that 
compose our approach. 

We evaluate the learning capability of the model using the following quantitative metrics. 
Table [T] shows the error types (confusion matrix) that can happen during the parameter 
estimations, four situations are identified: 


• True positive (TP) refers when the VSG was correctly classified. 

• False positive (FP) refers to an incorrectly identification, in our problem that occurs 
when the HMM predicts that some nucleotides are part of the VSG gene when this is 
not correct. 

• False negative (FN) refers when the model incorrectly rejected a VSG gene. 

• True negative (TN) refers when the model correctly rejected a VSG gene. 


We measure the quality of our estimation using two well-known measures of binary prediction 
quality: the True Positive Rate (TPR) (also named sensitivity) and the Positive Predietive 
Value (PPV) (also named precision). The TPR is the proportion of nucleotides correctly 
predicted over the total number of nucleotides that are part of VSG gene. It is defined as: 


TPR = 


TP 

TP+ FN' 


( 7 ) 


The PPV measures the proportion of correctly classify nucleotides over the total of VSG 
identifies. It is given by: 


PPV 


TP 

TP + FP’ 


( 8 ) 


11 





Figure 2: Schema of the procedure for gene sequence classification using HMM. A training 
set composed by sequences is the input of the Baum-Welch algorithm, then the Viterbi 
algorithm is applied producing a partial estimation. The training sequence is also used for 
generating an interval control corresponding the locations of the VSG genes. Next, the 
location interval of control is applied as filter on the partial estimation to produce the final 
estimation. 


Table 1: Confusion matrix: type of estimation errores. 


Prediction / Target 

Positive 

Negative 

Positive 

True Positive (TP) 

False Positive (FP) 

Negative 

False Negative (FN) 

True Negative (TN) 


The following hypothesis are considered. We assume that exists independence among 
sequences, as well as among genes into the sequences. That is, we dispose of a non-redundant 
collection of sequences of T. brucei genomic data. A common problem when we are working 
on pattern recognition for genome sequences is that we can not affirm the independence 
between genes and segment of genes in the sequence. Some parts in the sequence can be 
strongly correlated each of other. The correlation among several sequences arises from the 
process of phylogenetic inertia, basically this means that the sequences have the same origin. 
In spite of that, the process of divergence evolution can vanish the phylogenetic information. 
Even two sequences that diverged from a common ancestral sequence can absolutely lost 
the information from their ancestors if the sequences diverged enough. Often, it is used the 
percentage of identical base pair (sequence identity) between two sequences as correlation 
assessment between that sequences. As a consequence, in order to assume independence 
among sequences and to produce a non-redundant collection, the sequence percent identity 
is used as independence assessment reference. The VSG genes diverge from a common 


12 




























ancestor, despite that their amino acid identity is low. An usual assumption is considering 
the VSG genes as non-redundant set, and assuming the hypothesis of independence among 
the sequences. 


5. Results 

5.1 Data description 

We use a set of 9 genomic sequence segments from chromosome 9 of T. brucei genome 
where so me VSG gene cluste rs are located. The data set was taken from the GenBank 


database ( Benson et ah . 2009h . Table 2 shows size of sequences and number of annotated 


VSG genes for each one. Since in these genomic sequences the location of most VSG had been 
previously identified using traditional homology-based annotation methods, our predictions 
obtained with the HMM approach can be compared with these annotations obtained by 
traditional based methods. 


Table 2: Size of the sequence segments and number of annotated VSG genes in each sequence 
of chromosome 9 of T. brucei. 


Sequence Id 

Size 

Number of VSG genes 


75462 

16 

52 

67530 

21 

53 

19745 

7 

54 

89317 

10 

55 

102962 

3 

56 

17110 

5 

57 

88923 

8 

58 

24879 

0 

59 

65933 

7 


5.2 Applying the HMM model with 2 states 


We start by analysing the performance of the HMM model with 2 states. Without loss 
of generality we index the sequences in S from 1 to 9 as Si, S 2 , ■ ■ ■, Sg. We create three 
partitions of S using a uniform random selection. Each partition have two groups, one is 
used for training and another one for validations. Table [3] shows the training and validation 
sets for each partition of S. In case of HMM with 2 states the BW algorithm converges 
after 160 iterations for the three training sets. We compute the Viterbi path and the model 
accuracy. 

We reached the following estimated matrices when we used the sequences of the partition 

A 


f 0.9969 0.0031 \ 

V 0.0029 0.9971 ) ’ 


where the {i,j) element represents the transition probability from state i to j for i,j = 1,2 
and 
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Table 3: Generation of the training and validation sets. 


Partition Id 

Training set 

Validation Set 

A 

{ 51 , 52 ,^ 4 , 55 , 57 , 58 } 

{ 53 , 56 , 59 } 

B 

{ 51 , 52 , 53 , 54 , 56 , 59 } 

{ 55 , 57 , 58 } 

C 

{ 53 , 55 , 56 , 57 , 58 , 59 } 

{ 51 , 52 , 54 } 

i ( 

0.3815 0.1911 0.2018 

0.2255 \ 


0.1932 0.2213 0.2031 

0.3824 / 


where (i,l), i = 1,2 represents the probability of state i emits the symbol A, (i,2) the 
probability that state i emits the second symbol C and so on. The matrix of transition 
probabilities is diagonal dominant, therefore the chance that the model jumps from one 
states to another one is very low. 

Table m shows some metrics of the accuracy of our model. The rows of the table show the 
accuracy metric reached for the model with 2 states. The first column specify the metric, 
the next three columns shows the TPR, PPV, and number of VSG non-detected (ND-VSG) 
using the partition A. The other following columns present the results using B and C sets. 
The model predicts all VSG genes in almost all cases, therefore TPR is high. In the worst 
case the TPR is higher than 0.81. The PPV for the sequences Sg of the partition A and 

of the partition B are small, as a consequence the model prediction has a relatively high 
false positive error. 

Table 4: Sensitivity (TPR), precision (PPV) and number of non-detected VSG (ND-VSG) 
reached by a HMM with 2 states 



53 

A 

56 

59 

55 

B 

57 

58 

5i 

C 

52 

54 

TPR 

0.9709 

0.9634 

0.9838 

0.9581 

0.9485 

- 

0.9759 

0.8104 

0.9959 

PPV 

0.5214 

0.4652 

0.2781 

0.1769 

0.4775 

- 

0.3171 

0.4455 

0.3085 

ND-VSG 

0 

0 

0 

0 

0 

0 

8 

4 

0 


5.3 Application of the HMM with 3 states 

In order to improve the accuracy in predictions, we analyse a HMM with 3 states. Our 
purpose is to investigate the existence of another group that does not appear in the HMM 
with 2 states. We use training and validation set presented in Table [3j The BW algorithm 
converges for the three partitions in less than 400 iterations. 

The estimated matrix of transitions when was used the sequences of the partition A was, 

/ 0.9956 0.0028 0.0016 \ 

Q = 0.0021 0.9976 0.0003 , 

\ 0.0017 0.0003 0.9980 / 

where the {i, j) element represents the transition probability from state i to j for i, j = 1,2, 3. 
As well as in the case of a model with two states, that matrix is diagonal dominant. The 
estimated matrix of emissions when was used the sequences of the partition A was. 


14 












/ 0.3610 0.1455 0.1388 0.3547 \ 

£ = 0.1712 0.2346 0.2267 0.3675 , 

\ 0.3610 0.2330 0.2369 0.1691 / 

where the hrst column of the matrix corresponds to the base A, the second to the base 
C, the third one to the base G and the last one to the base T, so the (i, 1) represents the 
probability of state i emits the symbol A and so on, i = 1,2, 3. 

Table [5] presents the accuracy of the model with 3 states. The first three columns shows 
TPR, PPV and number of non-detected VSG genes for partition A. The next group of three 
columns present accuracy for partition B and the last columns corresponds to partition C. 
We can affirm that the model with 3 states fits better than the 2-state model. The TPR 
metric is slightly lower in this case, but the precision increases for all sequences. 


Table 5: Sensitivity (TPR), precision (PPV) and number of non-detected VSG (ND-VSG) 
reached by a HMM with 3 states. 



^3 

A 

^9 

^5 

B 

Sj 

^8 


C 

S2 

^4 

TPR 

0.9472 

0.9540 

0.9559 

0.9535 

0.9405 

- 

0.9559 

0.9280 

0.9403 

PPV 

0.6712 

0.7080 

0.4022 

0.3985 

0.6629 

- 

0.5027 

0.5435 

0.4557 

ND-VSG 

0 

0 

0 

0 

0 

0 

0 

0 

0 


Table [6] shows the locations of annotated VSG (left column) and predicted VSG (right 
column) by the HMM model with 3 states using the partition A and sequence 9. The 
model correctly predicts seven VSG genes and misclassihes other seven ones. The length 
of VSG genes correctly predicted in the sequence 9 varies from a minimum of 1115 up to 
2217. The VSG genes wrong predicted by the model have lengths: 113, 510, 653, 250, 
206, 5600 and 2432. Note that all the bad predictions are out of the range [1115,2217]. 
This situation is repeated in a large number of wrong predicted genes on the validation 
sequences. As a consequence, if we consider the lengths of the annotated VSG genes (in 
the training set) we can define a bounded range by the minimum and maximum length of 
well-predicted VSG genes. Then, it is possible to adjust the predictions provided by the 
model discarding those that are outside of this range. Therefore, we can use the training 
set for both: to set the parameters of the model and to define a range for the VSG length. 
Figure [3] illustrates how the model is improved when a location hlter is performed. Besides, 
it show s an additional chal lenge of the learning model due to the imbalance among the 
classes ( Ghawla et al.l . 2004 1. There are three histograms, the horizontal axe indicates the 
classes and the vertical axe indicates the percentage of genes. The data is the testing and 
corresponds to the sequence 9. There are two classes: the non-annotated VSG (denoted 
by 0) and the annotated VSG (denoted by 1). The left histogram shows the testing data, 
in the center is presented the estimations produced by the HMM without control of the 
gene locations, and the right histogram presents the estimations produced by the HMM 
with location control. The serie of histograms shows how is improved the percentage of 
well-classified genes for the sequence 9 using the location filter, the increment is larger than 
10 % of cases. 
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Figure 3: Example of the improvement for controlling the gene location. The data cor¬ 
responds to the sequence 9. Left histogram was generated with the original data, in the 
center the histogram was done with the estimations produced by the HMM, and the right 
histogram corresponds to the estimations with a location control. 


The VSG genes in the sequence 9 are located between the first 22000 nucleotides (Ta¬ 
ble [6|). In this region, the most likely path of states determined by the Viterbi algorithm 
shows that only states 1 and 3 are presented (the state 3 corresponds to the VSG genes), 
with only the exception between the coordinates 6277 — 6429. From 27560 until the end 
of the sequence, the changes are between states 1 and 2 (with only the exception between 
50143 and 52575 locations). This indicates that in addition to the VSG genes, the model 
detects other regions (corresponding to state 2) with particular characteristics that make 
them distinguishable from those predicted as states 1 and 3. We observe that those par¬ 
ticular regions located at the second middle of the sequence and relating to the state 2 
correspond to the VSG genes in the complementary sequence (Table [7]). Table [3 presents an 
example about the location of annotated (left column) and predicted (right column) VSG 
genes by the 3-states H MM for the comp l ementary sequence 9. Figu re IH was generated with 


the Artemis Software ( Rutherford et ah . 200d : Carver et ah . 2012h . It shows the location 


of annotated VSG genes and predicted by the 3-state model for sequence 9. The annotated 
VSG genes are indicated with the label “vsg” and the predictions are labeled as “pred”. The 
figure is useful for explain the model accuracy in finding the VSG on the positive strand 
(left to right direction, located at the top of the figure) and complementary strand (oppo¬ 
site direction, situated on the bottom) for the sequence 9. Table |8] shows the performance 
reached by the HMM with 3 states for the complementary sequence. The table presents the 
TPR and PPV values and the number of not detected VSG. The three groups of columns 
correspond to the A, B and C learning partitions. We can see that all VSG genes located 
on the complementary sequence are well-identified by the model. 
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Table 6: Location of the VSG genes annotates and predicted by the model with 3 states. 
The data corresponds to the sequence 9. 


Location of the VSG gene 

Location of the VSG prediction 

1001 - 2502 

658 - 2427 

3056 - 3169 

4223 - 5692 

3423 - 5640 

6846 - 7375 

6429 - 7544 

8875 - 10387 

8626 - 10315 

10829 - 11339 

12360 - 13819 

11555 - 13748 

14192 - 14845 

16342 - 17943 

16184 - 17849 

18148 - 18398 

18712 - 18918 

20343 - 21634 

19746 - 21585 

21676 - 27276 

50143 - 52575 


Entry: 0 CR9_cluster_VSG_3ecuencia9_. gbk 0 CR9_predlicciones_VSG_secuencia 9. gbk. 

□>□>[]> D> 13 3 3 

vsg vsg vs( vsg vsg vsg vsg 

StllK'B’Bin# MB’ 

pre pred pred re pred pred z( pred d :ed pred 

[esoo |l3000 |l9500 [26000 [32500 [39000 [45500 [52000 [58500 [65000 

<l <ign<itga<BMB daaiaBi €tM<iia<w 

pred pr j pred pred 1 pred pred pred d pred ;d pred i Dr pred cei pred pred 

ooooo oo oooo 

vsg vsg vsg vsg vsg vsg vsg vsg vsg vsg vsg 


Figure 4: Annotated VSG genes and their predictions by an HMM with 3 state s for the 
sequence 9. The Figure was generated using Artemis software ( Garver et ah . 2012I ). 


6. Conclusions and Future Work 

In this article we tackle a classification problem using a Hidden Markov Model (HMM). Our 
approach identified with relatively success a particular type of genes on Trypanosoma brucei 
(T. brucei) genome, those encoding Variant Surface Glycoproteins (VSG) from African try¬ 
panosomes. T. brucei uses the expression of different VSG proteins to evade host immune 
mechanisms provoking the Trypanosomiasis disease. The results shows that the VSG genes 
have characteristics that make them detectable as statistically homogeneous zones in T. bru¬ 
cei genome. We perform a HMM with 2 and 3 states. The model with 3 states outperforms 
the 2-states model, identifying VSG in the two opposite strand directions. The HMM with 
3 states is efficient in searching VSG genes, according our experiments in all cases the model 
reached a sensitivity over the 90% and the totality of VSG genes were well-identified. 
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Table 7: Location of the VSG gene and their predictions for the HMM with 3 states for the 
complementary sequence. The data corresponds to the sequence 9. 


Location of the VSG gene Location of the VSG prediction 
27500 - 29086 


30672 - 32290 

34509 - 35973 
37592 - 39046 
40462 - 41986 

44756 - 46342 

48819 - 50113 

53634 - 55059 
56172 - 57665 

59975 - 61544 

63341 - 64935 


27560 - 29217 
29404 - 29594 
30216 - 30348 
30752 - 32972 
33791 - 34145 
34619 - 36483 
37648 - 40083 
40528 - 42688 
43715 - 43783 
43853 - 44335 
44816 - 46881 
47692 - 48183 
48910 - 50143 
52782 - 53268 
53685 - 55339 
56223 - 57969 
58034 - 59358 
60042 - 62274 
62542 - 63076 
63466 - 65166 
65468 - 65933 


Table 8: Sensitivity (TPR), precision (PPV) and number of non-detected VSG (ND-VSG) 
reached by a HMM with 3 states for the complementary sequences. 



^3 

A 

Se 

^9 

^5 

B 

S7 

^8 


C 

S2 

^4 

TPR 

- 

- 

0.9508 

0.9485 

0.9568 

0.9540 

0.9226 

- 

0.9385 

PPV 

- 

- 

0.6157 

0.5568 

0.4833 

0.5422 

0.5233 

- 

0.6005 

ND-VSG 

0 

0 

0 

0 

0 

0 

0 

0 

0 


In addition, we propose a simple method for identifying the misclassified genes by the 
HMM model. We define a filter using the location of the VSG in the training data. Despite 
the simplicity of the approach presented in this article, according the empirical results for 
a particular genome the performance for gene detection was very high. This approach can 
be of great utility to apply to other African trypanosomes whose silent repertoires of VSG 
genes are not well-characterized experimentally as well as to others families of hyper-variable 
genes. 

The present work was restricted to detect VSG genes within regions that are already 
known to be VSG gene clusters. Therefore the problem was delimited to differentiate only 
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three states, VSG genes, VSG genes located on the complementary strand of DNA and 
intergenic regions. An interesting issue for future work is the following: it would be of great 
interest to distinguish also between genomic regions containing VSG clusters from other 
genomic regions containing regular protein coding genes (encoding for normal functions in 
the cell). 
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